Academic Open Internet Journal

ISSN 1311-4360

www.acadjournal.com

Volume 19, 2006

 

 

Efficient and scalable partition based algorithms for mining association rules

 

Muthukumar A1

Nadarajan R2

1 Corresponding author,

Assistant Professor,

Department of Computer Applications,

Kumaraguru College of Technology, Coimbatore-641 006.

Phone: 91-9345166161

Fax: 91-422-2669406

amk_in_2000@yahoo.com

 

2 Dean & HOD,

Department of Mathematics & Computer Applications,

PSG College of Technology, Coimbatore-641 004.

Phone: 91-422-5344157,5344777

Fax: 91-422-2573833

nadarajan_psg@yahoo.co.in

 

 

Abstract

The information revolution is generating mountains of data from sources like astronomy observations, credit card transactions, genetics research, telephone calls, and web clickstreams.  The progress in data mining and knowledge discovery since the first workshop on knowledge discovery in databases has been tremendous.  The research in data mining is developing fast and efficient algorithms to derive knowledge from huge databases. There are several data mining algorithms available to solve diverse data mining problems.  They are mainly classified as associations, classifications, sequential patterns, and clustering.  One of the knowledge and discovery in databases (KDD) operations is the problem of inducing association rules. One of the defining challenges for the KDD research community is scaling up data mining algorithms to mine very large collections of data.  In this paper, we have analyzed graph based association rule mining algorithms like primitive association rule mining, generalized association rule mining and multilevel association rule mining from the perspective of improving scalability and performance.  We have achieved scalability and performance by dividing the transaction database into partitions of equal size and brought one partition at a time to memory.    Primitive association rule mining, which is functionally similar to Apriori, outperforms Apriori.  Generalized association rule mining provides extra knowledge as sibling associations and even cross-parent associations.  Multilevel association rule mining algorithm takes care of analyzing different level wise knowledge.

Keywords:

Data mining algorithms, Association Rule Mining, Graph-based approach, Generalized Association Rule Mining, Multilevel Association Rule Mining.

 

 


1.      Introduction

 

With the rapid growth in size and number of available databases in commercial, industrial, administrative and other applications, it is necessary and interesting to examine how to extract knowledge automatically from huge amount of data [5].  Knowledge discovery in databases, or Data Mining, is the effort to understand, analyze, and eventually make use of huge volume of data available. Through the extraction of knowledge in databases, large databases will serve as a rich, reliable source for knowledge generation and verification and the discovered knowledge can be applied to information management, query processing, decision-making, process control and many other applications.  Therefore, data mining has been considered as one of the most important topics in databases by many database researchers.

 

There are several data mining techniques available to solve diverse data mining problems.  They are mainly classified as associations, classifications, sequential patterns, and clustering [3].  One of the KDD operations is the problem of inducing association rules.

The problem of mining association rules [3] in large databases is to generate all association rules, of the form X ̃ Y, that have support and confidence greater than the user-defined minimum support and minimum confidence.  Formally, Let I= {i1,i2,….in} be a set of items.  Let D be our database, where each transaction T is a set of items such that T < I.  We say that a transaction T contains X, a set of some items in I, if X Í T.  An association rule is an implication of the form X ̃ Y, where X ̀ I, Y ̀ I, and     X Ç Y =F.  The rule X ̃ Y holds in the database D with confidence c if at least c% of transactions in D that contain X also contain Y.  The rule X ̃ Y has support s in the database if at least s% of transactions in D contains XY.  Several algorithms for finding association rules were developed in the last years.  These algorithms find all the rules that satisfy a given support / confidence criterion.  Most algorithms have a common main structure, a two-step process: finding all sets of items that have support above the given minimum, and generating the desired rules from these itemsets. 

The Apriori algorithm is described as a “fast algorithm for mining association rules” and is based on [1].  The algorithm has the common structure of a two step process:  First all large itemsets that have support greater than the minimum support (minsup), are created incrementally.  L1 , the large itemsets of size 1, is created in the first pass over the data, by simply counting the appearance of each item in the data.  Subsequent L’s are created using their candidates.  The candidates are potential large itemsets of the current size.  The candidates are generated from the previous L using a fast and clever cross-product function.  Then a pass over the data determines the candidate’s support.  All candidates with support > minsup are placed in the next L.  This process stops when Ln is empty.  These large itemsets are then used to produce rules.  Each large itemset is divided to all   Head ̃ Body combinations and each combination is tested for sufficient confidence.  Here confidence is sup(Itemset)/ sup(Head). If the confidence tested is greater than the user defined minsup then the rule Head ̃ Body is valid and is kept in a list.

The disadvantage of this algorithm is number of data scans n where n is the size of large nonempty itemset. Also the number of discovered rules is huge while most of them are non-interesting.  Therefore several improved algorithms were proposed after Apriori for efficiency and scalability. 

One such algorithm is graph-based approach [2] to discover various types of association rules, namely, primitive association rules, generalized association rules and multiple-level association rules.  We would discuss these algorithms under related work.  Since this work uses large main memory, we propose a new method that uses main memory efficiently and thus become scalable.  The rest of this paper is organized as follows. Related work is presented in section 2, discovering scalable graph based approach in section 3 , performance of our data mining algorithm in section 4 and data mining applications in healthcare in section 5.  Finally we conclude the paper and present directions for future research in section 6.

2.      Related Work

 

The graph based method consists of three algorithms namely, finding a primitive association rule, generalized association rule and multilevel association rules [2].  Primitive association rule mining is mining of association rules that describe the association among items of the transaction in the database.  A primitive association pattern is a large itemset in which each item is a database item.  A uniform data mining framework was proposed in [2] for primitive association rule to discover association rules is given below. 

q       Numbering phase: In this phase, all the items are assigned an integer number.

q       Large item generation phase: This phase generates large items and records related information.  A large item is an item whose support is no less than a user specified minimum support.

q       Association graph construction phase: This phase constructs an association graph to indicate the association between large items.

q       Association pattern generation phase: This phase generates all association patterns by traversing the constructed association graph.

q       Association rule generation phase: The association rule can be generated directly according to the corresponding association patterns.

In [2], another algorithm proposed was Generalized Association Pattern Generation.  It discovers all generalized association patterns.  To generate generalized association patterns, one can add all ancestors for each items from concept hierarchy [4] and then apply the algorithm on the extended transactions.

The multiple-level association rules [6] are discovered from a large database of customer transactions in which all items are described by a set of relevant attributes.  Each attribute represents a certain concept and these relevant attributes form a set of multiple-level concepts..  For example, if the “category”, ”content”, and “brand” of an item have the domain values “bread”, ”wheat”, and “Wonder”, respectively, then this item is described as  “Wonder wheat bread” in the database.  Multilevel association graph is constructed by considering one level at a time.

In all the three algorithms in [2], the entire set of transactions is brought into main memory and then the graph is created.  From graph, the association patterns and rules are generated.   For large database this method is not usable.  Hence we propose a method in which the set of transactions is divided into number of partitions and one partition at a time is brought into main memory and the support counts of associations are calculated.  Only the association pattern is kept in the main memory and not the entire set of transactions and hence the method is usable and scalable. 

3.      Proposed Work

 

In this new approach we propose a new framework, which is scalable and efficient.  The entire database is divided into partitions of equal size.  Each partition is considered one at a time by loading the first partition into memory and calculating large itemsets and the corresponding support counts. Then second partition is considered similarly and the cumulative support count is calculated for the cumulative large itemsets. This process is continued for the entire set of partitions and finally we have the whole large itemsets and the corresponding cumulative support counts.  This approach reduces main memory requirement since it considers only a small partition at a time and hence it is scalable for any large size of the database.  We have developed three different algorithms for primitive association rule mining, generalized association rule mining and multilevel association rule mining using partition approach.

Example 3.1 :  Let us consider primitive rue mining for the database in table 1.

 

TID

Itemset

100

I1, I2, I5

200

I2, I4

300

I2, I3

400

I1, I2, I4

500

I1, I3

600

I2, I3

700

I1, I3

800

I1, I2, I3, I5

900

I1, I2, I3

Table1 : Transaction database

 

This table consists of nine transactions.  Let us divide this database into three partitions of equal size say three transactions each.  First we consider first partition only.  That is transactions 100, 200, 300 in first partition.  We consider bit vectors for each these transactions [7,8,9, 10].  Then we find their associations by finding logical AND of these bit vectors namely BV1 ^ BV2, BV1 ^ BV3, BV1 ^ BV4, BV1 ^ BV5. Similarly we find logical ANDing namely,  BV2 ^ BV3, ……., BV4 ^ BV5. We calculate support counts of these results.  This is continued for rest of the partitions.  The bit vector for partitions is given in table below. Here in the table 1, there are 9 transactions and 5 items.  We shall consider the transactions into 3 partitions consisting of 3 transactions each.  We identify bit vectors for the items in each partition as in table 2.

 

ID

Partition   1

Partition 2

Partition 3

I1

1 0 0

1 1 0

1 1 1

I2

1 1 1

1 0 1

0 1 1

I3

0 1 1

0 1 1

1 1 1

I4

0 1 0

1 0 0

0 0 0

I5

1 0 0

0 0 0

0 1 0

Table 2: Partitioned database

 

We perform logical AND one partition at a time.  I1 is AND’ed with I2, I3, I4, I5 for the first partition and the result that satisfy minimum support is considered. Then I2 is AND’ed with I3, I4, I5 for the first partition. This is continued for I3, I4 and I5 for the first partition. Once first partition calculation is over, the memory buffer contains large itemsets for first partition with corresponding support count. The above procedure is repeated for all partitions.  And we get cumulative support and large itemsets.  Then we construct the graph and traverse it to find patterns and using these patterns to find rules.

Here, the association patterns of order 3 are I1 I2 I3, I1 I2 I5 and the rules are       I1 ̃ I2 I3, I1 ̃ I2 I5, I1 I2 ̃ I3,          I1 I2 ̃ I5,  etc.   The corresponding graph in figure 1.

The generalized association rule mining is done in a similar manner using small partitions of the database into main memory one at a time.  Here we maintain a separate list of parents for the concept hierarchy tree.  Using this parent list the parents of each item are added into the transactions and this updated transaction is mined for generalized association rules.

            Figure 1 : Association graph

Example 3.2

Let us consider the concept hierarchy for the database..

Here, the items including parents are I1, I2, …., I13. We consider the extended transactions after adding parents from the parents list. Here we represent using binary vectors of each item.  After logical ANDing of I1^I2, I1^I3, ….., I12^13, we get the associated graph.

While constructing the graph there is no edge from a child to a parent and vice versa.   Then from the graph traversal is done to construct the patterns (like (I6,I10,I11) & (I7,I10,I11) ) and they are used for finding generalized association rules (like I6 ̃ I10,I11 and  I7̃ I10,I11  etc.).

The third algorithm multilevel association rule mining is considered for each level separately.  For example if the database considers a three level items namely ‘category’, ‘content’ and ‘brand’ represents for example “Wonder wheat bread” where category is bread, content is wheat and brand is Wonder.  Here the association rule is constructed level wise.  For example category wise irrespective of content and brand, or category and content wise irrespective of brands or by considering all three levels namely category, content and brand.

We can either represent this database as level-3 (ex. 111) or level-2 (ex. 11*) or level-1 (ex. 1**). Here after identifying the level, level wise database is partitioned into small partitions, which are brought to memory one at a time and the corresponding large itemsets with support counts, are identified.  Similarly this is repeated for second partition, etc. At the end of last partition the graph is constructed by taking into account the cumulative itemsets with cumulative support counts. 

The graphs are traversed for finding the patterns (ex. 11* 12* 22*, 11* 21* 22* etc.). Then rules are constructed (like    11* ̃ 12*, 22* and 11* ̃ 21*, 22*  etc.).  Other obvious rules are 111 ̃ 211 and 1** ̃ 2**.

 

4.      Performance Evaluation

 

For each algorithm, five different databases were considered with sizes of 10K, 20K, 30K, 40K and 50K transactions of synthetic data.  First we compare graph mining method with our proposed partitioned graph mining.  The performance shows encouraging results as in Figure 2 - 6.

            Figure 2 : GM vs. PGM

Here the x-axis shows size of data base in number of records and y-axis shows the execution time in seconds.

Next, the generalized graph mining is compared with partitioned generalized graph mining, which is shown in Figure 3.

             Figure 3: GGM Vs. PGGM

Multilevel association rule mining using graph based approach and partitioned approach are compared level-by-level.  In all these levels, the results are encouraging.  The level-1, level-2 and level-3 graphs are shown in Figures 4-6.

              Figure 4 : Level 1 Graph

                 Figure 5 : Level 2 Graph

 

 

            Figure 6: Level 3 Graph

In all the above performances show partitioned graph approach is much better than graph based mining of association rules and hence it is efficient and scalable.

5.      Applications in Healthcare

Today, the major hospitals are corporate hospitals with computerized information in place.  The disease of the patients, treatment history, medicines are all available in the database for each patient. We can mine this database for possible association of diseases and symptoms using association rule mining illustrated in this paper.  Also the considering the medicines as the items purchased, the shopping behaviors can be analyzed.  Other data mining techniques like clustering, classification; outlier analysis can also be applied suitably for the medical database.  Also the bio-informatics research is fast growing with integration of data mining techniques like in the fields of drug design etc.

 

6.      Conclusion and Future Work

 

Data Mining, recently, is becoming much more important as the number of databases and database size keeps growing.  Researchers in many different fields have shown great interest in data mining.  Different data mining tasks are association rule mining, classification, clustering, sequential pattern mining, OLAP and Web mining.  The task of association rule mining is to find certain association relationships among a set of objects in a database.  The association relationships are described in association rules.  There are different algorithms for association rule mining namely, apriori, fp-growth, graph based association rule mining.  Here we have discussed the limitation of graph based association rule mining namely memory usage.  Our approach divides the given database into number of partitions and working with the partitions one at a time in the memory and thus avoids bringing the whole database into memory.  We have analyzed and applied this approach into all the three kind of graph based association rule mining namely primitive association rule mining (PARM), generalized association rule mining (GARM) and multilevel association rule mining (MARM).  In all the cases, the performance of partitioned based approach outperforms graph-based approach mentioned in [2].

 

As all the databases are transaction databases, as a future improvement, we have plans to analyze the raw transaction database and convert into an optimized database using transaction database mining language, which we are working concurrently.  Also we would like to apply these algorithms to suitable applications with real time data.

 

Acknowledgements:

The authors express sincere thanks to the management of PSG College of Technology, Coimbatore to carry out this research by providing all resources.  The first author expresses his gratitude to his supervisor for his support in this work.   Also he thanks the management of Kumaraguru college of Technology for permitting him to do Ph.D. programme.

 

7.      References

1. R.Agrawal, R.Srikant, “Fast Algorithms for mining Association Rules”, Proceedings of the 20th VLDB Conference, pp.487-499, Santiago, Chile, 1994.

 

2. Show-Jane Yen and Arbee L.P.Chen, “A Graph based Approach for Discovering Various Types of Association Rules”, IEEE Transaction on Knowledge and Data Engineering, Vol.13, No.5, pp. 839-845, Sep/Oct. 2001.

 

3. Jiawei Han, Micheline Kamber, “Data Mining Concepts and Techniques”, Morgan Kauffman Publishers, San Francisco 2001.

 

4. R.Srikant and R.Agrawal, “Mining Generalized Association rules”, In Proc. Of the 21st Int. Conf. on Very Large Databases, pp.407-419, Zurich, Switzerland, 1995.

 

5. Willi Klosgen and Jan M. Zytkow, “Hand Book of Data Mining and Knowledge Discovery”, Oxford University Press, 2002.

 

6. Jiawei Han, Yongjian Fu, “Mining Multiple Level Association Rules in Large Databases”, Proc. of 1995 Int'l Conf. on Very Large Data Bases (VLDB'95), pp. 420-431, Zurich, Switzerland, 1995.

 

7. Muthukumar A, Nadarajan R, “An Efficient Approach to Mining Association Rules in Transaction Databases”, Indian Journal of Information Science and Technology, Irofis Publications, Vol. 1, No. 1, pp. 10-17, Nov. 2005.

 

8. Muthukumar A, Nadarajan R, “ A Novel Approach for Mining Association Rules in Dynamic Databases”, Proc. Of  National Conference on Advanced Computing (NCAC2006), MIT, Anna University, Chennai,  pp.142-147, Feb. 2006.

 

9. Jay Ayres, Johannes Gehrke, Tomi Yiu, Jason Flannick, “Sequential Pattern Mining  using a bit map representation”, KDD 02, pp. 429-435, 2002.

 

10. Mohammed J. Zaki, Ching-Jui Hsiaso, “CHARM : An Efficient Algorithm for Closed Itemset Mining”, SDM 2002.


eXTReMe Tracker

Technical College - Bourgas,

All rights reserved, © March, 2000